Previous studies have investigated student default on the Federal Direct Student Loan program, primarily focusing on undergraduate populations. However, this project aims to shift the focus to the analysis of graduate student default. This shift is motivated by the observation that the default rate among the graduate population at the college closely aligns with that of the undergraduate population.
Studies on undergraduate defaults have identified dropping out and race as key determinants. Yet, it remains uncertain whether these factors hold the same weight for graduate populations. Consequently, this analysis seeks to evaluate various classification algorithms for their accuracy in predicting students at risk of default.
Furthermore, we plan to employ classification algorithms in conjunction with the Gower clustering technique to extract insights and patterns from the subset of graduate borrowers who default. This approach aims to uncover nuanced factors contributing to default within the graduate student population.
R 4.2.1 and Python 3.9.13 are used for this analysis.
For a version with code please go to the link below:
Graduate Student Default Analysis with Code
Below are References of articles on student loan default.
Who Are Student Loan Defaulters? (2017, December 14). Center for American Progress. https://www.americanprogress.org/article/student-loan-defaulters/
Author, B. (2021, July 7). Who Is More Likely to Default on Student Loans? Liberty Street Economics. https://libertystreeteconomics.newyorkfed.org/2017/11/who-is-more-likely-to-default-on-student-loans/
| loan_status | Y=default, N=non default |
| gender | M=Male, F=Female |
| citizen_status | Citizen, Eligible non-citizen |
| marital_status | |
| efc | Expected Family Contribution-Index determining amount a student is capable of contributing to educational cost. Calculation includes income, assets, and family size. |
| School | SOE=SChool of Education, SOP=School of Psychology, SOM=School of Management, NEIB=New England Institute of Business |
| degr_cde | |
| degr_cde_school | Student’s degree and school combined. |
| exit_reason | G=Graduated, WD=Official Withdrawal, unk=unofficial withdrawal |
| Major | |
| local_hrs_attempt | Total credit hours attempted |
| local_hrs_earned | Total credit hours earned |
| yrs_to_pay_dte | Elapsed time between start date and payment start date |
| yrs_to_exit_dte | Elapsed time between start date and exit date |
| undergrad_loans_cc | Loans borrowed as undergraduate students at college |
| grad_loans_cc | Loans borrowed as graduate students at college |
| loans_not_cc | Loans borrowed at other colleges previous to start at college |
| NSLDS_loan_total | Total loans borrowed |
| age |
We’ll remove undergradutes from the dataset.
| Name | dflt_grad |
| Number of rows | 13348 |
| Number of columns | 32 |
| _______________________ | |
| Column type frequency: | |
| character | 20 |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| loan_status | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| pay_start_date | 0 | 1.00 | 9 | 9 | 0 | 6 | 0 |
| entry_dte | 2 | 1.00 | 2 | 18 | 0 | 182 | 0 |
| entrance_yr | 2 | 1.00 | 2 | 4 | 0 | 30 | 0 |
| exit_dte | 2 | 1.00 | 2 | 19 | 0 | 318 | 0 |
| exit_reason | 2 | 1.00 | 1 | 2 | 0 | 10 | 0 |
| loc_cde | 2 | 1.00 | 2 | 5 | 0 | 13 | 0 |
| div_cde | 2 | 1.00 | 2 | 2 | 0 | 14 | 0 |
| school | 28 | 1.00 | 2 | 3 | 0 | 6 | 0 |
| degr_cde | 2 | 1.00 | 2 | 5 | 0 | 21 | 0 |
| cip_desc | 83 | 0.99 | 2 | 60 | 0 | 33 | 0 |
| major_1 | 2 | 1.00 | 2 | 5 | 0 | 245 | 0 |
| major_minor_desc | 2 | 1.00 | 2 | 50 | 0 | 241 | 0 |
| value_description | 23 | 1.00 | 2 | 41 | 0 | 11 | 0 |
| marital_status | 1 | 1.00 | 1 | 9 | 0 | 8 | 0 |
| dob | 6 | 1.00 | 2 | 19 | 0 | 5501 | 0 |
| gender | 3 | 1.00 | 1 | 2 | 0 | 5 | 0 |
| city | 3 | 1.00 | 2 | 18 | 0 | 1420 | 0 |
| state | 4 | 1.00 | 2 | 2 | 0 | 44 | 0 |
| citizen_status | 1 | 1.00 | 1 | 20 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| career_hrs_attempt | 4 | 1.00 | 51.68 | 33.70 | 1 | 32 | 41 | 62.0 | 163 | ▅▇▁▂▁ |
| career_hrs_earned | 4 | 1.00 | 48.02 | 32.73 | 0 | 29 | 38 | 60.0 | 132 | ▃▇▂▁▂ |
| local_hrs_attempt | 4 | 1.00 | 41.44 | 22.97 | 1 | 28 | 38 | 53.0 | 163 | ▆▇▁▁▁ |
| local_hrs_earned | 4 | 1.00 | 37.78 | 21.88 | 0 | 24 | 36 | 48.0 | 124 | ▅▇▃▁▁ |
| xfer_hrs_earned | 4 | 1.00 | 10.24 | 22.44 | 0 | 0 | 0 | 3.0 | 90 | ▇▁▁▁▁ |
| total_income | 67 | 0.99 | 45899.29 | 140581.75 | -5844 | 19806 | 33399 | 51711.0 | 4520667 | ▇▁▁▁▁ |
| efc | 57 | 1.00 | 7142.12 | 33601.94 | 0 | 0 | 2457 | 9494.5 | 999999 | ▇▁▁▁▁ |
| life_sub_grad | 1 | 1.00 | 5285.48 | 8629.96 | 0 | 0 | 0 | 8500.0 | 57289 | ▇▂▁▁▁ |
| life_sub_ugrad | 1 | 1.00 | 3350.28 | 5914.47 | 0 | 0 | 0 | 4667.0 | 23000 | ▇▁▁▁▁ |
| life_unsub_grad | 1 | 1.00 | 17834.46 | 16473.61 | 0 | 0 | 15601 | 28680.0 | 91812 | ▇▅▂▁▁ |
| life_unsub_ugrad | 1 | 1.00 | 4013.02 | 7485.65 | 0 | 0 | 0 | 6000.0 | 48399 | ▇▁▁▁▁ |
| nslds_loan_total | 9 | 1.00 | 39057.61 | 27337.20 | 0 | 18875 | 33976 | 53500.0 | 161460 | ▇▆▂▁▁ |
Cleaning and preparation involves removing duplicate rows/columns, dropping or combining categories, renaming variables or categories, removing null entries,creating new variables, and formatting date features.
The date columns are type characters. These columns will be used to create new features and as such will need to be converted to a date format.
pay start date
entry data
Date of Dirth
exit date
| Name | dflt_grad |
| Number of rows | 9889 |
| Number of columns | 32 |
| _______________________ | |
| Column type frequency: | |
| character | 18 |
| numeric | 10 |
| POSIXct | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| loan_status | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| entrance_yr | 0 | 1.00 | 2 | 4 | 0 | 23 | 0 |
| exit_reason | 0 | 1.00 | 1 | 2 | 0 | 8 | 0 |
| loc_cde | 0 | 1.00 | 2 | 5 | 0 | 12 | 0 |
| div_cde | 0 | 1.00 | 2 | 2 | 0 | 6 | 0 |
| school | 0 | 1.00 | 2 | 3 | 0 | 5 | 0 |
| degr_cde | 0 | 1.00 | 2 | 5 | 0 | 14 | 0 |
| cip_desc | 64 | 0.99 | 2 | 60 | 0 | 26 | 0 |
| major_1 | 0 | 1.00 | 2 | 5 | 0 | 230 | 0 |
| major_minor_desc | 0 | 1.00 | 2 | 50 | 0 | 227 | 0 |
| value_description | 11 | 1.00 | 2 | 41 | 0 | 11 | 0 |
| total_income | 32 | 1.00 | 1 | 7 | 0 | 4616 | 0 |
| marital_status | 0 | 1.00 | 1 | 9 | 0 | 8 | 0 |
| efc | 32 | 1.00 | 1 | 6 | 0 | 3237 | 0 |
| gender | 1 | 1.00 | 1 | 2 | 0 | 5 | 0 |
| city | 1 | 1.00 | 2 | 18 | 0 | 1300 | 0 |
| state | 2 | 1.00 | 2 | 2 | 0 | 43 | 0 |
| citizen_status | 0 | 1.00 | 1 | 20 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| career_hrs_attempt | 2 | 1 | 40.74 | 20.70 | 1 | 32 | 38 | 49 | 147 | ▃▇▂▁▁ |
| career_hrs_earned | 2 | 1 | 37.87 | 19.83 | 0 | 28 | 36 | 48 | 132 | ▃▇▂▁▁ |
| local_hrs_attempt | 2 | 1 | 38.89 | 17.68 | 1 | 30 | 38 | 48 | 135 | ▃▇▂▁▁ |
| local_hrs_earned | 2 | 1 | 36.02 | 16.95 | 0 | 27 | 36 | 45 | 120 | ▃▇▂▁▁ |
| xfer_hrs_earned | 2 | 1 | 1.85 | 8.86 | 0 | 0 | 0 | 0 | 90 | ▇▁▁▁▁ |
| life_sub_grad | 0 | 1 | 7042.69 | 9321.41 | 0 | 0 | 0 | 13925 | 57289 | ▇▂▁▁▁ |
| life_sub_ugrad | 0 | 1 | 1428.96 | 4259.21 | 0 | 0 | 0 | 0 | 23000 | ▇▁▁▁▁ |
| life_unsub_grad | 0 | 1 | 23627.13 | 14879.58 | 0 | 12425 | 21667 | 32500 | 91812 | ▇▇▃▁▁ |
| life_unsub_ugrad | 0 | 1 | 1705.01 | 5279.06 | 0 | 0 | 0 | 0 | 41619 | ▇▁▁▁▁ |
| nslds_loan_total | 8 | 1 | 44859.55 | 27759.78 | 0 | 24780 | 40184 | 60083 | 161460 | ▇▇▃▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| pay_start_date | 0 | 1 | 2010-09-30 20:00:00 | 2018-09-29 20:00:00 | 2016-09-29 20:00:00 | 6 |
| entry_dte | 2 | 1 | 1994-09-25 20:00:00 | 2023-08-31 20:00:00 | 2012-09-12 20:00:00 | 104 |
| exit_dte | 2 | 1 | 2005-02-21 19:00:00 | 2023-08-30 20:00:00 | 2015-06-06 20:00:00 | 225 |
| dob | 18 | 1 | 1934-03-08 19:00:00 | 1994-09-17 20:00:00 | 1978-05-12 20:00:00 | 4067 |
Based on the output from the skim() function after the data
transformation we know that there are now 15 categorical features that
are of type character. Categorical features must
be of type factor for use in classification models, thus these
features will be transformed to type factor.
Figure 1
Table 1 displays the data set as having factor, numeric, and POSXct (date) features.
| exit_reason | loan_status | div_cde |
|---|---|---|
| WD:1377 | Y: 575 | CA: 831 |
| G :7145 | N:9314 | CG: 65 |
| NA: 2 | NA | DB: 19 |
| UK:1365 | NA | DL: 66 |
| NA | NA | GR:8906 |
| NA | NA | NA: 2 |
Create age
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 9889 entries, 0 to 9888
## Data columns (total 32 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 loan_status 9889 non-null category
## 1 pay_start_date 9889 non-null datetime64[ns]
## 2 entry_dte 9887 non-null datetime64[ns]
## 3 entrance_yr 9889 non-null category
## 4 exit_dte 9887 non-null datetime64[ns]
## 5 exit_reason 9889 non-null category
## 6 loc_cde 9889 non-null category
## 7 div_cde 9889 non-null category
## 8 school 9889 non-null category
## 9 degr_cde 9889 non-null category
## 10 cip_desc 9889 non-null category
## 11 major_1 9889 non-null category
## 12 major_minor_desc 9889 non-null category
## 13 career_hrs_attempt 9887 non-null float64
## 14 career_hrs_earned 9887 non-null float64
## 15 local_hrs_attempt 9887 non-null float64
## 16 local_hrs_earned 9887 non-null float64
## 17 xfer_hrs_earned 9887 non-null float64
## 18 value_description 9889 non-null category
## 19 total_income 9857 non-null category
## 20 marital_status 9889 non-null category
## 21 efc 9857 non-null category
## 22 dob 9871 non-null datetime64[ns]
## 23 gender 9889 non-null category
## 24 life_sub_grad 9889 non-null float64
## 25 life_sub_ugrad 9889 non-null float64
## 26 life_unsub_grad 9889 non-null float64
## 27 life_unsub_ugrad 9889 non-null float64
## 28 city 9889 non-null category
## 29 state 9889 non-null category
## 30 citizen_status 9889 non-null category
## 31 nslds_loan_total 9881 non-null float64
## dtypes: category(18), datetime64[ns](4), float64(10)
## memory usage: 1.7 MB
## 0 46.931319
## 1 38.596154
## 2 30.184066
## 3 43.615385
## 4 41.332418
## Name: age, dtype: float64
## 0 47.0
## 1 39.0
## 2 30.0
## 3 44.0
## 4 41.0
## Name: age, dtype: float64
Create Exit time frame
Create payment date timeframe
## 0 3.304945
## 1 5.337912
## 2 1.299451
## 3 0.664835
## 4 2.304945
## Name: yrs_to_pay_dt, dtype: float64
dflt_gr_3["yrs_to_pay_dt"]=dflt_gr_3["yrs_to_pay_dt"].round(2)
## 0 3.30
## 1 5.34
## 2 1.30
## 3 0.66
## 4 2.30
## Name: yrs_to_pay_dt, dtype: float64
## career_hrs_attempt career_hrs_earned local_hrs_attempt local_hrs_earned
## Min. : 1.00 Min. : 0.00 Min. : 1.00 Min. : 0.00
## 1st Qu.: 32.00 1st Qu.: 28.00 1st Qu.: 30.00 1st Qu.: 27.00
## Median : 38.00 Median : 36.00 Median : 38.00 Median : 36.00
## Mean : 40.74 Mean : 37.87 Mean : 38.88 Mean : 36.02
## 3rd Qu.: 49.00 3rd Qu.: 48.00 3rd Qu.: 48.00 3rd Qu.: 45.00
## Max. :147.00 Max. :132.00 Max. :135.00 Max. :120.00
## NA's :2 NA's :2 NA's :2 NA's :2
## xfer_hrs_earned life_sub_grad life_sub_ugrad life_unsub_grad
## Min. : 0.000 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0.000 1st Qu.: 0 1st Qu.: 0 1st Qu.:12425
## Median : 0.000 Median : 0 Median : 0 Median :21667
## Mean : 1.854 Mean : 7043 Mean : 1429 Mean :23627
## 3rd Qu.: 0.000 3rd Qu.:13925 3rd Qu.: 0 3rd Qu.:32500
## Max. :90.000 Max. :57289 Max. :23000 Max. :91812
## NA's :2
## life_unsub_ugrad nslds_loan_total age yrs_to_exit_dt
## Min. : 0 Min. : 0 Min. :22.0 Min. : 0.070
## 1st Qu.: 0 1st Qu.: 24780 1st Qu.:31.0 1st Qu.: 1.340
## Median : 0 Median : 40184 Median :37.0 Median : 1.960
## Mean : 1705 Mean : 44860 Mean :38.8 Mean : 2.427
## 3rd Qu.: 0 3rd Qu.: 60083 3rd Qu.:46.0 3rd Qu.: 2.890
## Max. :41619 Max. :161460 Max. :76.0 Max. :21.700
## NA's :8 NA's :18 NA's :2
## yrs_to_pay_dt undergrad_loans_cc grad_loans_cc total_loans_cc
## Min. : 0.070 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 1.270 1st Qu.: 0 1st Qu.: 19278 1st Qu.: 20500
## Median : 1.950 Median : 0 Median : 28735 Median : 30500
## Mean : 2.412 Mean : 3134 Mean : 30670 Mean : 33804
## 3rd Qu.: 2.760 3rd Qu.: 0 3rd Qu.: 39004 3rd Qu.: 42654
## Max. :21.700 Max. :57500 Max. :116599 Max. :138500
## NA's :2
## loans_not_cc
## Min. : 0
## 1st Qu.: 0
## Median : 1666
## Mean : 12811
## 3rd Qu.: 19574
## Max. :138500
## NA's :8
## career_hrs_attempt career_hrs_earned local_hrs_attempt local_hrs_earned
## Min. : 1.00 Min. : 0.00 Min. : 1.00 Min. : 0.00
## 1st Qu.: 32.00 1st Qu.: 28.00 1st Qu.: 30.00 1st Qu.: 27.00
## Median : 38.00 Median : 36.00 Median : 38.00 Median : 36.00
## Mean : 40.74 Mean : 37.87 Mean : 38.88 Mean : 36.02
## 3rd Qu.: 49.00 3rd Qu.: 48.00 3rd Qu.: 48.00 3rd Qu.: 45.00
## Max. :147.00 Max. :132.00 Max. :135.00 Max. :120.00
## NA's :2 NA's :2 NA's :2 NA's :2
## xfer_hrs_earned life_sub_grad life_sub_ugrad life_unsub_grad
## Min. : 0.000 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0.000 1st Qu.: 0 1st Qu.: 0 1st Qu.:12425
## Median : 0.000 Median : 0 Median : 0 Median :21667
## Mean : 1.854 Mean : 7043 Mean : 1429 Mean :23627
## 3rd Qu.: 0.000 3rd Qu.:13925 3rd Qu.: 0 3rd Qu.:32500
## Max. :90.000 Max. :57289 Max. :23000 Max. :91812
## NA's :2
## life_unsub_ugrad nslds_loan_total age yrs_to_exit_dt
## Min. : 0 Min. : 0 Min. :22.0 Min. : 0.070
## 1st Qu.: 0 1st Qu.: 24950 1st Qu.:31.0 1st Qu.: 1.340
## Median : 0 Median : 40375 Median :37.0 Median : 1.960
## Mean : 1705 Mean : 45159 Mean :38.8 Mean : 2.427
## 3rd Qu.: 0 3rd Qu.: 60078 3rd Qu.:46.0 3rd Qu.: 2.890
## Max. :41619 Max. :161460 Max. :76.0 Max. :21.700
## NA's :18 NA's :2
## yrs_to_pay_dt undergrad_loans_cc grad_loans_cc total_loans_cc
## Min. : 0.070 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 1.270 1st Qu.: 0 1st Qu.: 19278 1st Qu.: 20500
## Median : 1.950 Median : 0 Median : 28735 Median : 30500
## Mean : 2.412 Mean : 3134 Mean : 30670 Mean : 33804
## 3rd Qu.: 2.760 3rd Qu.: 0 3rd Qu.: 39004 3rd Qu.: 42654
## Max. :21.700 Max. :57500 Max. :116599 Max. :138500
## NA's :2
## loans_not_cc
## Min. : 0
## 1st Qu.: 0
## Median : 1666
## Mean : 12811
## 3rd Qu.: 19574
## Max. :138500
## NA's :8
## loans_not_cc
## Min. : 0
## 1st Qu.: 0
## Median : 1636
## Mean : 12801
## 3rd Qu.: 19574
## Max. :138500
## career_hrs_attempt career_hrs_earned local_hrs_attempt local_hrs_earned
## Min. : 1.00 Min. : 0.00 Min. : 1.00 Min. : 0.00
## 1st Qu.: 32.00 1st Qu.: 28.00 1st Qu.: 30.00 1st Qu.: 27.00
## Median : 38.00 Median : 36.00 Median : 38.00 Median : 36.00
## Mean : 40.74 Mean : 37.87 Mean : 38.89 Mean : 36.02
## 3rd Qu.: 49.00 3rd Qu.: 48.00 3rd Qu.: 48.00 3rd Qu.: 45.00
## Max. :147.00 Max. :132.00 Max. :135.00 Max. :120.00
## NA's :2 NA's :2 NA's :2 NA's :2
## xfer_hrs_earned life_sub_grad life_sub_ugrad life_unsub_grad
## Min. : 0.000 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0.000 1st Qu.: 0 1st Qu.: 0 1st Qu.:12450
## Median : 0.000 Median : 0 Median : 0 Median :21667
## Mean : 1.854 Mean : 7044 Mean : 1429 Mean :23632
## 3rd Qu.: 0.000 3rd Qu.:13925 3rd Qu.: 0 3rd Qu.:32500
## Max. :90.000 Max. :57289 Max. :23000 Max. :91812
## NA's :2
## life_unsub_ugrad nslds_loan_total age yrs_to_exit_dt
## Min. : 0 Min. : 361 Min. :22.0 Min. : 0.070
## 1st Qu.: 0 1st Qu.: 24950 1st Qu.:31.0 1st Qu.: 1.340
## Median : 0 Median : 40376 Median :37.0 Median : 1.960
## Mean : 1705 Mean : 45168 Mean :38.8 Mean : 2.427
## 3rd Qu.: 0 3rd Qu.: 60080 3rd Qu.:46.0 3rd Qu.: 2.890
## Max. :41619 Max. :161460 Max. :76.0 Max. :21.700
## NA's :18 NA's :2
## yrs_to_pay_dt undergrad_loans_cc grad_loans_cc total_loans_cc
## Min. : 0.070 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 1.270 1st Qu.: 0 1st Qu.: 19300 1st Qu.: 20500
## Median : 1.950 Median : 0 Median : 28735 Median : 30500
## Mean : 2.412 Mean : 3135 Mean : 30676 Mean : 33811
## 3rd Qu.: 2.760 3rd Qu.: 0 3rd Qu.: 39006 3rd Qu.: 42654
## Max. :21.700 Max. :57500 Max. :116599 Max. :138500
## NA's :2
## loans_not_cc
## Min. : 0
## 1st Qu.: 0
## Median : 1653
## Mean : 12804
## 3rd Qu.: 19574
## Max. :138500
##
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Median Age: Female 38 Male 39
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20
Figure 21
Figure 22
Missing Values
The above output shows missing values for both efc and age features.
We will deal with these missing values using median imputation.
#Combine degr_cde and school features into one feature.
Before we embark on model building, a crucial preliminary step is pre-processing. This involves splitting our data into training, and test sets, as well as the transformation of numerical and categorical features into formats conducive to classification.
## The Target categories: Index(['Y', 'N'], dtype='object'):
We will separate the data to get predictor features and target features.
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 9327 entries, 0 to 9326
## Data columns (total 12 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 local_hrs_attempt 9327 non-null float64
## 1 gender 9327 non-null category
## 2 citizen_status 9327 non-null category
## 3 age 9327 non-null int32
## 4 yrs_to_pay_dt 9327 non-null float64
## 5 Race 9327 non-null category
## 6 marital_status 9327 non-null category
## 7 undergrad_loans_cc 9327 non-null float64
## 8 grad_loans_cc 9327 non-null float64
## 9 loans_not_cc 9327 non-null float64
## 10 efc 9327 non-null int32
## 11 degrCde_school 9327 non-null category
## dtypes: category(5), float64(5), int32(2)
## memory usage: 483.0 KB
## None
From the above output we can see that the target feature has been removed.
## <class 'pandas.core.series.Series'>
## RangeIndex: 9327 entries, 0 to 9326
## Series name: loan_status
## Non-Null Count Dtype
## -------------- -----
## 9327 non-null category
## dtypes: category(1)
## memory usage: 9.2 KB
## None
the y output shows only the loan_status feature.
## loan_status
## 0 0
## 1 0
## 2 0
## 3 0
## 4 0
## dtype('int64')
From the two above outputs we can see that the target feature has been converted into binary numerical data type. We will convert the data type back to categorical.
## Target Feature categories as binary: Index([0, 1], dtype='int64'):
## Shape of Predictor Features is (9327, 12):
## Shape of Target Feature is (9327, 1):
## *********** X Structure***********
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 9327 entries, 0 to 9326
## Data columns (total 12 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 local_hrs_attempt 9327 non-null float64
## 1 gender 9327 non-null category
## 2 citizen_status 9327 non-null category
## 3 age 9327 non-null int32
## 4 yrs_to_pay_dt 9327 non-null float64
## 5 Race 9327 non-null category
## 6 marital_status 9327 non-null category
## 7 undergrad_loans_cc 9327 non-null float64
## 8 grad_loans_cc 9327 non-null float64
## 9 loans_not_cc 9327 non-null float64
## 10 efc 9327 non-null int32
## 11 degrCde_school 9327 non-null category
## dtypes: category(5), float64(5), int32(2)
## memory usage: 483.0 KB
The makeup of the X data frame is 9,326 rows and 12 columns while the y data frame’s has 9326 rows and 1 column.
We will now split X,y into Train and Test sets
## Shape of X Train (6995, 12):
## Shape of X Test (2332, 12):
## Shape of y Train (6995, 1):
## Shape of y Test (2332, 1):
y features will be transformed to numpy arrays
## Shape of y Train (6995, 1):
## Shape of y Test (2332, 1):
Next we will transform y features to one dimensional arrays
## Shape of y Train rv (6995,):
## Shape of y Test rv (2332,):
From the above output we see that y train and y test have been transformed into one dimensional numpy arrays.
Our next step involves transforming the predictor features into formats compatible with machine learning.
Numerical feature transformation is achieved through scaling. This process is crucial to prevent a feature with a wide range, such as in the thousands, from being considered more significant than a feature with a narrower range. Scaling ensures that all features hold equal importance before being applied to a machine learning algorithm. Various methods exist for scaling features, and for this analysis, we will use standard scaling. This technique transforms the data to have a zero mean and a variance of one, rendering the data unitless.
Given that most machine learning algorithms only accept numerical features, categorical features, in their original form, are deemed unacceptable. Therefore, it is necessary to encode these categorical features into numerical values, a process known as categorical encoding. In this analysis, we will employ one-hot encoding. This method represents categorical features as a set of binary features, with each binary feature corresponding to a specific category. The binary feature takes the integer value 1 if the category is present and 0 otherwise
## Categorical columns are: ['gender', 'citizen_status', 'Race', 'marital_status', 'degrCde_school']
## Numerical columns are: ['local_hrs_attempt', 'age', 'yrs_to_pay_dt', 'undergrad_loans_cc', 'grad_loans_cc', 'loans_not_cc', 'efc']
First, we will create transformed train and test sets for the Logistic Regression model. This entails dropping the first category of each feature during One Hot Encoding.
## ************First Five Rows X_train_lr************
## scale__local_hrs_attempt ... ohe__degrCde_school_PHD_SOE
## 572 -0.005663 ... 0.0
## 3382 0.767671 ... 0.0
## 8832 -0.177515 ... 0.0
## 1304 0.051621 ... 0.0
## 1734 0.166189 ... 0.0
##
## [5 rows x 25 columns]
## count ... max
## scale__local_hrs_attempt 6995.0 ... 5.035331
## scale__age 6995.0 ... 3.423558
## scale__yrs_to_pay_dt 6995.0 ... 9.616802
## scale__undergrad_loans_cc 6995.0 ... 5.684823
## scale__grad_loans_cc 6995.0 ... 4.982445
## scale__loans_not_cc 6995.0 ... 6.141727
## scale__efc 6995.0 ... 1.964071
## ohe__gender_M 6995.0 ... 1.000000
## ohe__citizen_status_eligible_non_citizen 6995.0 ... 1.000000
## ohe__Race_Asian 6995.0 ... 1.000000
## ohe__Race_Hispanic 6995.0 ... 1.000000
## ohe__Race_Other 6995.0 ... 1.000000
## ohe__Race_White 6995.0 ... 1.000000
## ohe__marital_status_married 6995.0 ... 1.000000
## ohe__marital_status_separated 6995.0 ... 1.000000
## ohe__marital_status_single 6995.0 ... 1.000000
## ohe__degrCde_school_CAGS_SOP 6995.0 ... 1.000000
## ohe__degrCde_school_CT_SOE 6995.0 ... 1.000000
## ohe__degrCde_school_CT_SOP 6995.0 ... 1.000000
## ohe__degrCde_school_MBA_SOM 6995.0 ... 1.000000
## ohe__degrCde_school_MED_SOE 6995.0 ... 1.000000
## ohe__degrCde_school_MED_SOP 6995.0 ... 1.000000
## ohe__degrCde_school_MM_SOM 6995.0 ... 1.000000
## ohe__degrCde_school_PHD_NIB 6995.0 ... 1.000000
## ohe__degrCde_school_PHD_SOE 6995.0 ... 1.000000
##
## [25 rows x 8 columns]
## ************First Five Rows X_test_lr************
## scale__local_hrs_attempt ... ohe__degrCde_school_PHD_SOE
## 2851 -1.609616 ... 0.0
## 4926 -0.521219 ... 0.0
## 8628 1.197302 ... 0.0
## 4069 0.968166 ... 0.0
## 7763 -1.323195 ... 0.0
##
## [5 rows x 25 columns]
Upon examining the initial five rows of both the training and test sets, it is evident that the features have undergone transformation while concurrently preserving the column feature names.
## Shape of X Train lr (6995, 25):
## Shape of X Test lr (2332, 25):
Next, we transform training and test sets for all other models. During OneHotEncoding, the first category will be dropped only if the feature is binary.
## ************First Five Rows X_train_tr************
## num__local_hrs_attempt ... cat__degrCde_school_PHD_SOE
## 572 -0.005663 ... 0.0
## 3382 0.767671 ... 0.0
## 8832 -0.177515 ... 0.0
## 1304 0.051621 ... 0.0
## 1734 0.166189 ... 0.0
##
## [5 rows x 28 columns]
## ************First Five Rows X_test_tr************
## num__local_hrs_attempt ... cat__degrCde_school_PHD_SOE
## 2851 -1.609616 ... 0.0
## 4926 -0.521219 ... 0.0
## 8628 1.197302 ... 0.0
## 4069 0.968166 ... 0.0
## 7763 -1.323195 ... 0.0
##
## [5 rows x 28 columns]
## Shape of X Train tr (6995, 28):
## Shape of X Test tr (2332, 28):
In comparing the shape outputs of both the training and test sets, we observe three additional columns compared to the logistic regression transformed data
From the Loan Status table in the Exploratory Data Analysis section,
we can deduce that only 6% of loan borrowers in the dataset have
defaulted. It is evident that this dataset is imbalanced, where the
target class exhibits an uneven distribution of
observations—specifically, one class label has a considerably higher
number of observations, while the other has a significantly lower
number.
Classes that constitute a large proportion of the dataset are
referred to as majority classes, whereas those making up a smaller
proportion are considered minority classes. To address this imbalance,
we will implement oversampling on the minority class (those who
defaulted). The oversampling process will involve utilizing the SMOTE
algorithm, which generates synthetic (resampled) data based on the
characteristics of the nearest neighbors
## Original and Resampled Training Target Feature Categories:
##
## Original Target Feature y Train tr Counter({0: 6608, 1: 387}):
##
## Resampled Target Feature y Train sm lr Counter({0: 6608, 1: 6608}):
After resampling, the category 1 (Y) now has the same number of rows as category 0 (N)
## Loan Status categories resampled: (13216, 25):
For the purpose of evaluating model performance, the event of interest for our analysis is if the loan status is Y (defaulted). This is considered the positive class.
Classification metrics will determine how well our models predict the event of interest.
Accuracy-measures the number of predictions that are correct as a percentage of the total number of predictions that are made. As an example, if 90% of your predictions are correct, your accuracy is simply 90%. Calculation: number of correct predictions/Number of total predictions. TP+TN/(TP+TN+FP+FN)
Precision-tells us about the quality of positive predictions. It may not find all the positives but the ones that the model does classify as positive are very likely to be correct. As an example, out of everyone predicted to have defaulted, how many of them actually did default? So within everything that has been predicted as a positive, precision counts the percentage that is correct. Calculation: True positives/All Positives. TP/(TP+FP)
Recall- tells us about how well the model identifies true positives. The model may find a lot of positives yet it also will wrongly detects many positives that are not actually positives. Out of all the patients who have the disease, how many were correctly identified? So within everything that actually is positive, how many did the model successfully to find. A model with low recall is not able to find all (or a large part) of the positive cases in the data. Calculated as: True Positives/(False Negatives + True Positives)
F1 Score-The F1 score is defined as the harmonic mean of precision and recall.
The harmonic mean is an alternative metric for the more common arithmetic mean. It is often useful when computing an average rate. https://en.wikipedia.org/wiki/Harmonic_mean
The formula for the F1 score is the following: 2 times((precision*Recall)/(Precision + Recall))
Since the F1 score is an average of Precision and Recall, it means that the F1 score gives equal weight to Precision and Recall:
Logistic Regression is a binary classification model that finds the probability or odds ratio of an event. Our model has two events or possible outcomes, yes or no. A probability between 0 and 1 of an observation is produced for both events. Example: Probability of an observation being “Y” is .65 whereas the probability of “N” for the same observation is .35.
If you would like to know more about Logistic Regression check out the link below.
We will partition the training set into equal subsets. The subsets are used to assess a model’s performance on training data through cross validation.
The process works by setting aside the first fold as a test set and the remaining subsets are used as the aggregated training set. The model is trained on the aggregated training set then the performance is evaluated on the testing set. This will continue until all folds have been held out as a test set. An evaluation metric is calculated for each iteration then averaged together. This results in a cross validated metric.
Set grid to find best parameters
Below are the best paramters selectd for the logistic regression
model.
LogisticRegression(C=0.1, max_iter=2000, solver='liblinear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(C=0.1, max_iter=2000, solver='liblinear')
## Average Training Accuracy Score: (0.656)
## Average Training AUC Score: (0.697)
## ****Logistic RegressionValidation Classification Report****
## precision recall f1-score support
##
## 0 0.97 0.57 0.72 2202
## 1 0.08 0.66 0.15 130
##
## accuracy 0.57 2332
## macro avg 0.52 0.62 0.43 2332
## weighted avg 0.92 0.57 0.68 2332
## Area Under the Curve Score-Log Reg Test Set: (0.615)
cm_lr_test = metrics.confusion_matrix(y_test_rv, y_pred_lr, labels=[0,1])
df_cm_lr_test = pd.DataFrame(cm_lr_test, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_lr_test.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_lr_test.flatten()/np.sum(cm_lr_test)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize=(11,8))
sns.heatmap(df_cm_lr_test, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title("Confusion Matrix-Logistic Regression", fontsize=14)
Figure 23
For the logistcal regression model we took the absolute value of the coefficients so as to get the Importance of the feature both with negative and positive effect.
Now that we have the importance of the features we will now transform the coefficients for easier interpretation. The coefficients are in log odds format. We will transform them to odds-ratio format.
## ******************Top Five Coefficients******************
## Feature Exp_Coefficient
## 22 ohe__degrCde_school_MM_SOM 2.272679
## 5 scale__loans_not_cc 1.172586
## 21 ohe__degrCde_school_MED_SOP 1.125975
## 20 ohe__degrCde_school_MED_SOE 1.083622
## 2 scale__yrs_to_pay_dt 1.083519
The logistic regression model did not perform well based on the cross-validation scores. The accuracy score of 0.66 translates to correctly classifying an observation only 66% of the time. The AUC score of 0.697 places the model in the average category.
Metrics for model performance on the test set (unseen data) were worse. The accuracy score of 0.576 means correctly classifying an observation only 57%, while the AUC score of 0.61 categorizes the performance of this model as below average. Supporting this assessment are the recall and precision scores from the Classification Report. Recall indicates that the model captures approximately six out of ten possible observations for the event of interest (Y). When the event of interest captures less than one out of ten, it is accurate.
## Original and Resampled Training Target Feature Categories:
##
## Original Target Feature y Train tr Counter({0: 6608, 1: 387}):
##
## Resampled Target Feature y Train sm lr Counter({0: 6608, 1: 6608}):
Below are the best parameters chosen for the linaer svc model.
SVC(C=10, coef0=1, gamma=0.0001, kernel='linear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=10, coef0=1, gamma=0.0001, kernel='linear')
## Average Training Accuracy SVC Score: (0.694)
## Average Training Area Under the Curve SVC Score: (0.751)
## ****Support Vector Machines Test Classification Report****
## precision recall f1-score support
##
## 0 0.96 0.64 0.76 2202
## 1 0.08 0.54 0.14 130
##
## accuracy 0.63 2332
## macro avg 0.52 0.59 0.45 2332
## weighted avg 0.91 0.63 0.73 2332
## Average Test Set Area Under the Curve SVC Score: (0.587)
cm_svc_test = metrics.confusion_matrix(y_test_rv, y_pred_svc, labels=[0,1])
df_cm_svc_test = pd.DataFrame(cm_svc_test, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_svc_test.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_svc_test.flatten()/np.sum(cm_svc_test)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize=(9,6))
sns.heatmap(df_cm_svc_test, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title("Confusion Matrix-Support Vector Machine ", fontsize=11)
Figure 24
## ******************Top Five Coefficients******************
## Feature Exp_Coefficient
## 22 ohe__degrCde_school_MM_SOM 2.167627
## 10 ohe__Race_Hispanic 1.947730
## 1 scale__age 1.287697
## 5 scale__loans_not_cc 1.217136
## 2 scale__yrs_to_pay_dt 1.157403
Below are the best paramters chosen for the non linear svc model
SVC(C=1, coef0=0, degree=1, gamma=1, probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=1, coef0=0, degree=1, gamma=1, probability=True)
## Average Training Accuracy Score: (0.984)
## Average Training Area Under the Curve Score: (0.996)
## ****Support Vector Machines (RBF) Test Classification Report****
## precision recall f1-score support
##
## 0 0.98 0.99 0.98 2202
## 1 0.73 0.61 0.66 130
##
## accuracy 0.97 2332
## macro avg 0.85 0.80 0.82 2332
## weighted avg 0.96 0.97 0.96 2332
## Average Test Set Area Under the Curve Score: (0.797)
cm_rbf_test = metrics.confusion_matrix(y_test_rv, y_pred_rbf, labels=[0,1])
df_cm_rbf_test = pd.DataFrame(cm_rbf_test, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_rbf_test.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_rbf_test.flatten()/np.sum(cm_rbf_test)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize=(9,6))
sns.heatmap(df_cm_rbf_test, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title("Confusion Matrix-Support Vector Machine (RBF)", fontsize=10)
Figure 25
The linear SVC model’s training scores show a slight improvement over the logistic regression model’s scores. With an accuracy of 0.69, it classifies observations correctly at almost 70%. The AUC score is slightly above average. However, when we examine the test metrics, the potential promise of this model comes crashing down. The accuracy drops to 0.59, and the test AUC score is even lower at 0.58, making this model just marginally better than flipping a coin. The linear SVC model’s scores could be affected by the non-linearity of the dataset.
Few datasets are perfectly linear. As such, we tried a non-linear SVC model. The non-linear SVC model shows a significant improvement over its linear counterpart in training scores, with accuracy at 0.984 and AUC at 0.99. These scores categorize the non-linear SVC as a very good model. Checking the test metrics, the accuracy is slightly lower at 0.97. The AUC score drops by 0.20, coming in at 0.80. Although the AUC score is significantly lower, we can still classify it as good. Now, let’s assess how well the non-linear model classifies the event of interest (Y). The recall score tells us that the non-linear model captures six out of ten “Y” classes, and precision informs us that seven out of ten observations classified as “Y” are accurate. This represents a definite improvement over our previous models, although it still hovers around average. The high accuracy score is influenced by the model’s superior ability to classify the “N” class.
## Original and Resampled Training Target Feature Categories:
##
## Original Target Feature y Train tr Counter({0: 6608, 1: 387}):
##
## Resampled Target Feature y Train sm tr Counter({0: 6608, 1: 6608}):
Below are the best parameters chosen for the xgboost model.
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=0.9, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=0.1, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.2, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=15, max_leaves=None,
min_child_weight=5, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=1000, n_jobs=-1,
num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=0.9, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=0.1, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.2, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=15, max_leaves=None,
min_child_weight=5, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=1000, n_jobs=-1,
num_parallel_tree=None, random_state=None, ...)
## Average Training Accuracy XGBoost Score: (0.982)
## Average Training Area Under the Curve Score: (0.997)
## ****XGBoost Test Classification Report****
## precision recall f1-score support
##
## 0 0.98 0.99 0.98 2202
## 1 0.74 0.61 0.67 130
##
## accuracy 0.97 2332
## macro avg 0.86 0.80 0.82 2332
## weighted avg 0.96 0.97 0.96 2332
## Average XGBoostTest Set Area Under the Curve Score: (0.797)
cm_xg_test = metrics.confusion_matrix(y_test_rv, y_pred_xg, labels=[0,1])
df_cm_xg_test = pd.DataFrame(cm_xg_test, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_xg_test.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_xg_test.flatten()/np.sum(cm_xg_test)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize=(9,6))
sns.heatmap(df_cm_xg_test, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title("Confusion Matrix-XGBoost", fontsize=14)
Figure 26
Figure 26
Figure 27
Below are the best parameters chosen for the gradient boost model.
GradientBoostingClassifier(max_depth=15, max_features=9, min_samples_leaf=60,
min_samples_split=1000, n_estimators=4000,
subsample=0.7, warm_start=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(max_depth=15, max_features=9, min_samples_leaf=60,
min_samples_split=1000, n_estimators=4000,
subsample=0.7, warm_start=True)## Average Training Accuracy Gradient Boost Score: (0.983)
## Average Training Area Under the Curve-GradientBoost Score: (0.997)
## ****Gradient Boosting Classification Report****
## precision recall f1-score support
##
## 0 0.98 0.99 0.98 2202
## 1 0.78 0.62 0.69 130
##
## accuracy 0.97 2332
## macro avg 0.88 0.80 0.84 2332
## weighted avg 0.97 0.97 0.97 2332
## Average Test Set Area Under The Curve- Gradient Boost Score: (0.802)
cm_gb_test = metrics.confusion_matrix(y_test_rv, y_pred_gb, labels=[0,1])
df_cm_gb_test = pd.DataFrame(cm_gb_test, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_gb_test.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_gb_test.flatten()/np.sum(cm_gb_test)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize=(9,6))
sns.heatmap(df_cm_gb_test, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title("Confusion Matrix-Gradient Boost", fontsize=14)
Figure 28
Figure 29
Below are the best parameters chosen for the ada boost
model.
AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=6),
learning_rate=0.5, n_estimators=1000, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=6),
learning_rate=0.5, n_estimators=1000, random_state=1)DecisionTreeClassifier(max_depth=6)
DecisionTreeClassifier(max_depth=6)
## Average Training Accuracy Score-Ada Boost: (0.983)
## Average Training Area Under the Curver Score-Ada Boost : (0.982)
## ****AdaBoost Validation Classification Report****
## precision recall f1-score support
##
## 0 0.98 0.99 0.98 2202
## 1 0.83 0.61 0.70 130
##
## accuracy 0.97 2332
## macro avg 0.90 0.80 0.84 2332
## weighted avg 0.97 0.97 0.97 2332
## Average Test Set Area Under the Curve Score-Ada Boost : (0.800)
cm_adb_vl = metrics.confusion_matrix(y_test_rv, y_pred_adb, labels=[0,1])
df_cm_adb_vl = pd.DataFrame(cm_adb_vl, index=["Actual - No", "Actual - Yes"], columns=["Predicted - No", "Predicted - Yes"])
group_counts = ["{0:0.0f}".format(value) for value in cm_adb_vl.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm_adb_vl.flatten()/np.sum(cm_adb_vl)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize=(9,9))
sns.heatmap(df_cm_adb_vl, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title("Confusion Matrix-AdaBoost", fontsize=9)
Figure 30
Figure 31
Gower Distance
Select features
Figure 31
Based on the elbow plot (Figure 31), we will select six for our cluster generation.
Cluster Dimensions
Figure 32
Figure 33
Figure 34
Figure 35
Figure 36
Figure 37
Figure 38
Figure 39
Figure 40
Figure 41
Figure 42
Figure 43
Figure 44
Figure 45
Figure 46
Figure 47
Figure 48
Figure 49
Figure 50
Figure 51
Figure 52
Figure 53
Figure 54
saveRDS(degree_clust_2_perc, "degree_clust_2_perc.rds")
Figure 55
Figure 56
Figure 57
Figure 58
Figure 59
Figure 60
Figure 61
Figure 62
Figure 63
Figure 64
Figure 65
Figure 66
Figure 67
Figure 68
Figure 69
Figure 70
Figure 71
Figure 72
Figure 73
Figure 74
Figure 75
Figure 76
Figure 77
Figure 78
Figure 79
From Figure 34, we observe the default percentage of each cluster.
Remember, from Table 12, the default percentage for graduate students is
6%. Clusters 3, 5, and 6 are above this percentage, and as such, we’ll
focus on these clusters to find what stands out.
Cluster 3 has the highest default percentage at 9.53%. The median
total loans borrowed at CC (Figure 38) are the lowest of all clusters,
while the median total loans borrowed at other colleges (Figure 40) are
the highest at 16,741. The median loans borrowed are striking, as except
for clusters 3 and 6, all other clusters have a zero median. The higher
median of loans borrowed at other colleges is why Cluster 3’s median
total loans borrowed (Figure 41) is the second highest at 46,892. The
median EFC is 50, placing this group as high need. Attempted and earned
credits (Figures 43 and 44) are low at 15 and 11, respectively. This is
possibly due to students in this group exiting sooner in their
enrollment, as other clusters have medians of 30 or greater. This is
supported by Figure 45, which shows the median exit time is less than a
year, and Figure 60, which has no students in this group completing
their degree. Another striking feature is the race distribution (Figure
58), which shows that just under three-quarters of Cluster 3 is African
American.
Cluster 5 has the second-highest default percentage at 7.33% (Figure
34). The median EFC for this group is, just like Cluster 3, placing this
group as high need. We find from Figure 70 that 97% of this group is
Hispanic, the majority of which are from Puerto Rico, Lawrence, and
Springfield.
Cluster 6’s default percentage is only slightly higher than the
population percentage at 6.21% (Figure 34). Its median loans borrowed at
CC are 37,484 (Figure 38), only slightly higher than Cluster 4. Median
loans borrowed at other colleges (Figure 40) are 12,871. The median
total loans borrowed (Figure 41) are 57,499, 10,000 greater than the
next closest cluster. The median EFC (Figure 42) is 188, and African
Americans make up 98% of this group.
Our clusters can help gleam more information supporting feature importance. Specifically, we’ll focus on clusters 3,5 and 6 as these clusters have default rates above 6 %.
These clusters all have median efc’s below 200 which places them as especially financially needy populations.
Cluster 3 and 6 have non cc loan medians above 6000, while cluster 6 also has grad loans borrowed at cc median of 56,000.
Delving further, we find cluster 3 and 6 have populations that are majority African American while cluster 5 is majority Hispanic population.
Finally, from cluster 3 we find the entire population withdrew or unofficially withdrew.
Action Plan:
Individual loan entrance counseling plan will be designed for students entering with loan balances from previous colleges and efc’s below 200.Loan entrance counseling for Hispanic students will be provided in Spanish.
Students withdrawing from the college will be contacted immediately and provided information on a third-party loan counseling center for help with repayment options.
Though all students are provided loan exit counseling, those students with total loan debt over 50,000 will be contact individually and provided contact information to a third part loan counseling service.
Let’s check on how our segmentation of this population can focus on details. Firstly, clusters 3,5, and 6 all had median efc’s under two hundered. Secondly, clusters 3 and 6’s poipulation are mostly African American and cluster 5’s poipukation is mostly hispanic. Thirdly, Cluster’s 3 and 6 have high median loans borrowed at other instituitions. Cluster 6 addtionallhy has the highest median of graduate loans borrowed at CC.
Personlized entrance counseling plans will be designed for populations with efc’s under 200 and African-American, and entering with previous loan debt. Hispanic students with efc’s under 200 will have entrance counseling designed in spanish tohelp guid them in borrowering in their first language. Though student’s withdrawing is signficant in cluster 3, we will reach out to all students withdrawing without a dgree and provide them thriud party loan counseling to help them paln for their lkan payments. Student’s borroweing over 50,000 will be encouraged to utilze our thuird party loan counseling service to udnerstrand the best payment options and their rights.